Building a best-in-class automated de-identification tool for electronic health records through ensemble learning
نویسندگان
چکیده
•An ensemble approach to automated de-identification of unstructured clinical text•Our leverages advances in deep learning along with heuristics•Detected personally identifiable information is replaced suitable surrogates•Patient data are de-identified at scale accelerate medical discovery Clinical notes electronic health records convey rich historical regarding disease and treatment progression. However, this text often contains such as names, phone numbers, or residential addresses patients, thereby limiting its dissemination for research purposes. The removal patient identifiers, through the process de-identification, enables sharing while preserving privacy. Here, we present a best-in-class which automatically detects identifiers substitutes them fabricated ones. Our required harness unstructured, context-rich aid advancement. presence (PII) natural language portions (EHRs) constrains their broad reuse. Despite continuous improvements detection PII, residual require manual validation correction. describe an system that employs architecture, incorporating attention-based deep-learning models rule-based methods, supported by heuristics detecting PII EHR data. Detected then transformed into plausible, though fictional, surrogates further obfuscate any leaked identifier. outperforms existing tools, recall 0.992 precision 0.979 on i2b2 2014 dataset 0.994 0.967 10,000 from Mayo Clinic. presented here generation modern machine-learning applications help discoveries. widespread adoption care systems has enabled digitization journeys. While structured elements EHRs (e.g., insurance billing codes) have been relied upon support business front office decades, history physical pathology reports) far richer nuanced about care, supporting novel research.1Wagner T. Shweta F. Murugadoss K. Awasthi S. Venkatakrishnan A.J. Bade Puranik A. Kang M. Pickering B.W. O’Horo J.C. et al.Augmented curation massive reveals symptoms impending COVID-19 diagnosis.Elife. 2020; 9: e58227Crossref PubMed Scopus (27) Google Scholar, 2Iqbal E. Mallah R. Rhodes D. Wu H. Romero Chang N. Dzahini O. Pandey C. Broadbent Stewart al.ADEPt, semantically-enriched pipeline extracting adverse drug events free-text records.PLoS One. 2017; 12: e0187121Crossref (16) 3Jung LePendu P. Chen W.S. Iyer S.V. Readhead B. Dudley J.T. Shah N.H. Automated off-label use.PLoS 2014; e89324Crossref (42) 4Afzal Sohn Scott C.G. Liu Kullo I.J. Arruda-Olson A.M. Surveillance Peripheral Arterial Disease cases using processing notes.AMIA Jt. Summits Transl Sci. Proc. 2017: 28-36PubMed 5Finlayson S.G. Building graph medicine millions narratives.Sci. Data. 1: 140032Crossref Scholar defined Health Insurance Portability Accountability Act 1996 (HIPAA), personal name, number, address.6Office Civil Rights H.H.S. Standards privacy individually information. Final rule.Fed. Regist. 2002; 67: 53181-53273PubMed As consequence, limited reuse secondary purposes.7Berg, H., Henriksson, A., Dalianis, (2020). Impact De-identification Downstream Named Entity Recognition Text. Proceedings 11th International Workshop Text Mining Information Analysis.Google HIPAA permits derived be widely shared used when it de-identified. Under Privacy Rule, can accomplished several ways. most straightforward Safe Harbor implementation, necessitates enumerated list 18 categories direct Social Security number) quasi-identifiers date service). Implementing scalable method competing requirements. First, regulatory perspective, must achieve extremely high recall, needs detect nearly all instances PII. Second, utility precision, so maximize correctness biomedical performed. And, third, cost effective, reasonable amount time. traditional expensive, time consuming, prone human error,8Neamatullah I. Douglass M.M. Lehman L.-W.H. Reisner Villarroel Long W.J. Szolovits Moody G.B. Mark R.G. Clifford G.D. records.BMC Med. Inform. Decis. Mak. 2008; 8: 32Crossref (206) Scholar,9Douglass Computer-assisted free MIMIC II database.Computers Cardiology. 2004; : 341-344https://doi.org/10.1109/CIC.2004.1442942Crossref makes more promising alternative.10Leevy J.L. Khoshgoftaar T.M. Villanustre Survey RNN CRF text.J. Big 7: 73Crossref (5) Scholar,11Yogarajan V. Pfahringer A review automatic end-to-end de-identification: accuracy only metric?.Appl. Artif. Intelligence. 34: 251-269Crossref Several recent (NLP) created opportunity build accurate systems. transfer autoregressive autoencoder models12Yang Z. Dai Yang Y. Carbonell J. Salakhutdinov R.R. Le Q. XLNet: generalized pretraining understanding.in: Advances Neural Processing Systems. arXiv, 2019Google supervised task named entity recognition (NER) requires very few labeled data, reducing effort error. models, transformers,13Vaswani Shazeer Parmar Uszkoreit Jones L. Gomez A.N. Kaiser Polosukhin Attention you need.in: 2017Google allow non-sequential enable contextualized word representations. Third, semantic segmentation algorithms generate subword-based vocabulary,14Sennrich, R., Haddow, B., Birch, (2016). Machine Translation Rare Words Subword Units. 54th Annual Meeting Association Computational Linguistics (Volume Papers).Google Scholar,15Kudo, T., Richardson, (2018). SentencePiece: simple independent subword tokenizer detokenizer Processing. 2018 Conference Empirical Methods Natural Language Processing: System Demonstrations.Google capture out-of-vocabulary words. Finally, transformer architecture improved bidirectional encoder representations transformers (BERT)16Devlin Lee Toutanova BERT: Pre-training Deep Bidirectional Transformers Understanding. Linguistics), 2019: 4171-4186Google similar technologies jointly train masked model (MLM) pre-training objective next sentence prediction task. BERT set stage context-independent terms training context-sensitive transform those context-aware based occurrence term sentence. We leverage these formulate NER problem. In paper, integrate collection approaches, blending beneficial aspects rules heuristics, create de-identification. transforms each detected instance surrogate mitigate risk re-identify patients (Figure 1). nference tool accessed https://academia.nferx.com/deid/. first compare performance other methods dataset.17Stubbs Uzuner Ö. Annotating longitudinal narratives i2b2/UTHealth corpus.J. Biomed. 2015; 58: S20-S29Crossref (85) resulting evaluated F1 scores (formulation provided supplemental methods) groups Table 1. substantially larger diverse Clinic perform deeper dive types errors, distribution errors per physician note, note type. It should noted analysis focuses solely does not address re-identification semantics fails detect, issue beyond scope study.Table 1The entities covered group quasi-identifiersGroup nameIncluded entitiesA (entities implementation)(1) age over 89, (2) phone/fax (3) email addresses, (4) websites URLs, IP (6) dates, (7) (8) record (9) vehicle/device (10) account/certificate/license (11) plan (12) street address, (13) city, (14) ZIP code, (15) employer names family membersBGroup (17) provider (doctor/nurse) (18) user IDs (of providers)CGroup B (19) organization/facility (20) country, (21) stateIt C encompass Harbor. Open table new tab Heart Risk Factors challenge17Stubbs publicly available documents annotated elements. This consists 792 test 515 notes. compared our six established tools: proposed Dernoncourt al. blends conditional random fields (CRFs) artificial neural networks (ANNs),18Dernoncourt J.Y. recurrent networks.J. Am. Assoc. 24: 596-606Crossref (132) Scrubber,19McMurry Fitch Savova G. Kohane I.S. Reis B.Y. Improved integrative modeling both public private text.BMC 2013; 13: 112Crossref Physionet,8Neamatullah Philter,20Norgeot Muenzen Peterson T.A. Fan X. Glicksberg B.S. Schenk Rutenberg Oskotsky Sirota Yazdany al.Protected filter (Philter): accurately securely de-identifying notes.NPJ Digit 3: 57Crossref MIST,21Aberdeen Bayer Yeniterzi Wellner Clark Hanauer Malin Hirschman MITRE Identification Scrubber Toolkit: design, training, assessment.Int. 2010; 79: 849-859Crossref (82) NeuroNER.22Dernoncourt NeuroNER: easy-to-use program named-entity networks.arXiv. (1705.05487)Google results 2. cite ANN (CRF + ANN)18Dernoncourt against (HIPAA only) reported paper. also directly report Scrubber, Physionet, Philter prior publications20Norgeot without performing empirical because (2014 i2b2) same investigation. trained MIST sentences corpus (see S3). downloaded pre-trained NeuroNER methods). 1) entities, use basis comparison.Table 2Performance corpusMethod nameGroupPrecisionRecallF1Basis resultsCRF (Dernoncourt al.)A0.9790.9780.978Dernoncourt al.18Dernoncourt ScholarPhysionetB0.8940.6980.784Norgeot al.20Norgeot ScholarScrubberB0.7620.8780.815Norgeot ScholarPhilterB0.7850.999∗Best metric.0.879Norgeot ScholarMIST (trained i2b2)B0.9070.8790.893N/ANeuroNERB0.9790.9500.964N/Anference (fine-tuned Mayo)B0.9610.9880.974N/Anference i2b2)B0.979∗Best metric.0.9920.985∗Best metric.N/AThe Philter, previous publications. and, thus, was dataset. NeuroNER. two versions were fine-tuned (1) datasets.∗ Best metric. datasets. system. version did utilize characteristics When B, achieved score 0.961, 0.988, 0.974, respectively. second involved fine-tuning set. could incorporate inclusion lists templates associated since small “methods” details). increased 0.979, 0.992, 0.985, Precision identifier type S4. consisted randomly sampled 104 million corresponding 477,000 patients' records. evaluation performed best represented (in F1) 3. best, 0.967, 0.994, Compared dataset, see (increase 0.01) reduced value (decrease 0.021). achieves 0.928, 0.933, 0.931, lower than Among three demonstrates relatively 0.918. Closely following 0.889 overall dataset.Table 3Performance datasetMethodPrecisionRecallF1Scrubber0.7560.6770.715Philter0.7090.9180.800Physionet0.8370.7720.803MIST Mayo)0.8180.8890.852NeuroNER Mayo)0.9280.9330.931nference Mayo)0.967∗Best metric.0.994∗Best metric.0.979∗Best metric.These entities.∗ These entities. investigated where failed successfully element completely (i.e., false negatives). occurred rate 0.6% 4). Across considered set, there 848 error contained negative errors. Accounting duplicate occurrences sentence, 797 unique instances. grouped prevalence category shown column, third column represents contribution (sums 0.6%).Table 4Prevalence examples negatives encountered applied setCategoryNumber (n = 797)Contribution (E 0.6%)Example (the fictitious)Clinic location2080.1461%He had DWI January do Samson rehab St. Louis, MissouriDates1830.1285%CPL dated 4/27/04Doctor/nurse name/initial1690.1187%Sent: 2020-10-20 10:00 a.m.. Subject: RE: Consumer/PatPharmacy name540.0379%S: fax received Trioki Rx request RX Viread (tenofovir)Phone number500.0351%Phone number patient/caller calling provider: 724.161.1754Organization/company350.0246%Last talked her involvement called GO GIRLS!Health organization220.0154%Jane brought Minerva female attendant said Jane like "weeks weeks."Numeric identifier90.0063%Manufactured Merck lot 78-32-DK, expiration 2020/10/20Location (address partial address)80.0056%500 State Highway 72Patient name40.0028%PLOF: X self cares livingThe highlighted italics indicate phrase detect. prevalent pertaining clinic locations (208 797). Many due partially identified phrases “Room 7A” missed “out Southwest Room 7A”). 183 negatives. doctor/nurse initials, 169 Abbreviations shorthand providers (typically signing off note) contributed category. Ambiguous resulted reader would difficulty/uncertainty deeming An example “tp” “Comment: 03-12-2005 08:04:12—verified tp.” found 26% nurse abstractors themselves agree characterization (Cohen's κ non-errors, 0.7453), pointing inherent ambiguity. (false negatives) per-note level. 5, distributed across 637 Furthermore, majority spread evenly (525 notes, 82.4%, contain single error). For subsequent rate, computed coverage fraction subset up rate.Table 5Distribution noteErrors noteNumber notesTotal errorsCumulative errorsPII coverageAverage types09,363000.9940015255255250.99781.002801606850.99891.56310307150.99911.7546247390.99922.3056307690.99942.7562127810.99952.572147950.99962.2582168110.99972.2593278380.99992.33101108481.00002.00PII rate. Average denotes distinct (such name errors) note. Even large (more 6), between 2 illustrates artifact repetition within example, 10 eight related location, remaining date. Examples location “Location INR sample: Other: Smallville Smallville”, “Recommend Recheck: 04/01/2017 Smallville”. pertain “Smallville,” how effective content smaller suggested raw count. (“04/01/2017”) detected. Both synthetic values purpose example. progress emergency visit, telephone encounter). Given structure vary greatly one another, analyze enrichment them. From 134 least 1 top 14 highest listed 6. Notes “Anti-coag service visit summary” (22 26 notes), followed “Electrocardiogram” (19 30 notes).Table 6Distribution typeNote typeNo. instancesNo. errorTotal No. notesFraction errorPhone message/call607,466546050.09Ambulatory summary5914,502493340.15Physician office/clinic message428,352366610.05Report503,173361310.27Medication renewal/refill364,626313580.09Progress practice274,975242370.10Ambulatory discharge medication list278,109232260.10Anti-coag summary241,18922260.85Electrocardiogram1941119300.63Anticoagulation intake—text495,77718500.36Letter153,519141570.09Ambulatory depart summary123,938121630.07Progress notes143,943111990.06Telephone encounter122,034112730.04The proportion given last column. indicates likely occur. originated multiple (including Epic Cerner) spanning 20 years. includes journey addition tables containing lab measurements, diagnosis information, orders, administration conducted approval Institutional Review Board. sentences. yielded 172,102 subsequently ground truth label every and/or phrase. Each different abstractors. interannotator agreement labeling token Cohen's 0.9694 S5 additional selected fine-tune models. manually 61,800 tagged See details. described section state-of-the-art conjunction harvested (each below) handle semi-structured 2). There salient features worth noting. Sentence-based template matching prune out either lack specific well-defined patterns. identifies complementary P
منابع مشابه
Best Practices in Electronic Health Records
The American Health Information Management Association (AHIMA) is the premier association of health information management (HIM) professionals. AHIMA's 50,000 members are dedicated to the effective management of personal health information needed to deliver quality healthcare to the public. Founded in 1928 to improve the quality of medical records, AHIMA is committed to advancing the HIM profes...
متن کاملInformation Security Requirements for Implementing Electronic Health Records in Iran
Background and Goal: ICT development in recent years has created excellent developments in human social and economic life. One of the most important opportunities to use information technology is in the medical field, that the result would be electronic health record (EHR).The purpose of this research is to investigate the effects information securi...
متن کاملInformation Security Requirements for Implementing Electronic Health Records in Iran
Background and Goal: ICT development in recent years has created excellent developments in human social and economic life. One of the most important opportunities to use information technology is in the medical field, that the result would be electronic health record (EHR).The purpose of this research is to investigate the effects information securi...
متن کاملBest Practices for Implementing Electronic Health Records and Information Systems
This chapter introduces a multi-level, multidimensional meta-framework for successful implementations of EHR in healthcare organizations. Existing implementation frameworks do not explain many features experienced and reported by implementers and have not helped to make health information technology implementation any more successful. To close this gap, we have developed an EHR implementation f...
متن کاملLeveraging text skeleton for de-identification of electronic medical records
BACKGROUND De-identification is the first step to use these records for data processing or further medical investigations in electronic medical records. Consequently, a reliable automated de-identification system would be of high value. METHODS In this paper, a method of combining text skeleton and recurrent neural network is proposed to solve the problem of de-identification. Text skeleton i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Patterns
سال: 2021
ISSN: ['2666-3899']
DOI: https://doi.org/10.1016/j.patter.2021.100255